A continuación mostramos el diccionario de datos con el que trabajaremos. Este diccionario de datos se corresponderá con los datos con los que trabajaremos después de modificar los datos originales
Los gobiernos federal, provincial y territorial de Canadá trabajan para mejorar la seguridad vial con el fin de reducir el número de víctimas mortales.
Teniendo en cuenta todas las variables que tenemos, se quiere predecir si habrá fallecimientos o no. El momento a evaluar el modelo será una vez producido el accidente. Es decir, evaluaremos el modelo teniendo en cuenta unicamente las variables conocidas en el instante en el que se produce una colisión. Todos aquellos registros conocidos a posteriori no serán incluidos en nuestro modelo. La finalidad del modelo es realizar una predicción con el mayor porcentaje de acierto posible, que nos indique en el momento que se produza un accidente si habra al menos un fallecido, o bien han sobrevibido todos los involucrados en el mismo.
Para ello, disponemos de un conjunto de datos proporcionado por La base de datos nacional de colisiones de Transport Canada (NCDB). Dicha base, contiene datos sobre todas las colisiones de vehículos de motor notificadas en Canadá que las provincias y territorios proporcionan cada año (1999-2014).
! conda info
active environment : ML_big_practice
active env location : /Users/aitor/opt/anaconda3/envs/ML_big_practice
shell level : 2
user config file : /Users/aitor/.condarc
populated config files : /Users/aitor/.condarc
conda version : 4.10.3
conda-build version : 3.21.4
python version : 3.8.8.final.0
virtual packages : __osx=10.16=0
__unix=0=0
__archspec=1=x86_64
base environment : /Users/aitor/opt/anaconda3 (writable)
conda av data dir : /Users/aitor/opt/anaconda3/etc/conda
conda av metadata url : None
channel URLs : https://repo.anaconda.com/pkgs/main/osx-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/osx-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /Users/aitor/opt/anaconda3/pkgs
/Users/aitor/.conda/pkgs
envs directories : /Users/aitor/opt/anaconda3/envs
/Users/aitor/.conda/envs
platform : osx-64
user-agent : conda/4.10.3 requests/2.26.0 CPython/3.8.8 Darwin/20.4.0 OSX/10.16
UID:GID : 501:20
netrc file : None
offline mode : False
Se importan las librerías necesarias para poder llevar a cabo este notebook. Ajustamos las opciones de visualización de los datos para poder ver todas las columnas del dataset
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math
import missingno as msno
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import pyarrow.parquet as pq
import pyarrow as pa
pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 5000)
Cargamos FUNCIONES. De esta manera, importamos las funciones necesarias en este notebook del archivo FUNCIONES.py, donde se encuentran todas las funciones creadas que nos ayudan a alcanzar nuestro objetivo con su correspondiente explicación .
%run FUNCIONES
Lectura de los datos proporionados en formato .csv. Mostramos las 8 primeras filas del dataset para realizar una primera observación
df_pd_data = pd.read_csv("/Users/aitor/Desktop/Máster Ciencia de Datos/Aprendizaje automático/Machine_Learning/big_practice_data/NCDB_1999_to_2014.csv")
df_pd_data.head(8)
/Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3444: DtypeWarning: Columns (1,2,5,12) have mixed types.Specify dtype option on import or set low_memory=False. exec(code_obj, self.user_global_ns, self.user_ns)
| C_YEAR | C_MNTH | C_WDAY | C_HOUR | C_SEV | C_VEHS | C_CONF | C_RCFG | C_WTHR | C_RSUR | C_RALN | C_TRAF | V_ID | V_TYPE | V_YEAR | P_ID | P_SEX | P_AGE | P_PSN | P_ISEV | P_SAFE | P_USER | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1999 | 1 | 1 | 20 | 2 | 02 | 34 | UU | 1 | 5 | 3 | 03 | 01 | 06 | 1990 | 01 | M | 41 | 11 | 1 | UU | 1 |
| 1 | 1999 | 1 | 1 | 20 | 2 | 02 | 34 | UU | 1 | 5 | 3 | 03 | 02 | 01 | 1987 | 01 | M | 19 | 11 | 1 | UU | 1 |
| 2 | 1999 | 1 | 1 | 20 | 2 | 02 | 34 | UU | 1 | 5 | 3 | 03 | 02 | 01 | 1987 | 02 | F | 20 | 13 | 2 | 02 | 2 |
| 3 | 1999 | 1 | 1 | 08 | 2 | 01 | 01 | UU | 5 | 3 | 6 | 18 | 01 | 01 | 1986 | 01 | M | 46 | 11 | 1 | UU | 1 |
| 4 | 1999 | 1 | 1 | 08 | 2 | 01 | 01 | UU | 5 | 3 | 6 | 18 | 99 | NN | NNNN | 01 | M | 05 | 99 | 2 | UU | 3 |
| 5 | 1999 | 1 | 1 | 17 | 2 | 03 | 1 | 2 | 1 | 01 | 01 | 01 | 1984 | 01 | M | 28 | 11 | 1 | UU | 1 | ||
| 6 | 1999 | 1 | 1 | 17 | 2 | 03 | 1 | 2 | 1 | 01 | 02 | 01 | 1991 | 01 | M | 21 | 11 | 1 | UU | 1 | ||
| 7 | 1999 | 1 | 1 | 17 | 2 | 03 | 1 | 2 | 1 | 01 | 02 | 01 | 1991 | 02 | F | UU | 13 | 2 | UU | 2 |
Observamos la clase de las variables que tiene nuestro dataset
df_pd_data.dtypes
C_YEAR int64 C_MNTH object C_WDAY object C_HOUR object C_SEV int64 C_VEHS object C_CONF object C_RCFG object C_WTHR object C_RSUR object C_RALN object C_TRAF object V_ID object V_TYPE object V_YEAR object P_ID object P_SEX object P_AGE object P_PSN object P_ISEV object P_SAFE object P_USER object dtype: object
Mostramos las dimensiones del dataset con el que se va a trabajar
print("Las dimensiones del dataset que vamos a analizar son", df_pd_data.shape[0], "x", df_pd_data.shape[1])
Las dimensiones del dataset que vamos a analizar son 5860405 x 22
Comprobamos los valores únicos que hay en cada una de las columnas haciendo uso de la función valores_unicos. Si en una variable existen más de 'k' categóricas únicas, no te las muestra. Por lo tanto, elegimos un 'k' elevado para que nos enseñe todo.
valores_unicos(df_pd_data, k=150)
{'C_YEAR': [1999,
2000,
2001,
2002,
2003,
2004,
2005,
2006,
2007,
2008,
2009,
2010,
2011,
2012,
2013,
2014],
'C_MNTH': [1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
'12',
'UU',
'01',
'02',
'11'],
'C_WDAY': [1, 2, 3, 4, 5, 6, 7, '7', 'U', '1', '2', '3', '4', '5', '6'],
'C_HOUR': ['20',
'08',
'17',
'15',
'14',
'01',
'11',
'13',
'19',
'16',
'09',
'02',
'18',
'12',
'10',
'23',
'00',
'06',
'07',
'21',
'UU',
'05',
'22',
'03',
'04'],
'C_SEV': [2, 1],
'C_VEHS': ['02',
'01',
'03',
'04',
'06',
'07',
'09',
'UU',
'05',
'13',
'08',
'12',
'14',
'10',
'11',
'16',
1,
2,
3,
4,
5,
6,
8,
7,
12,
10,
'26',
'71',
19,
25,
9,
'21',
'27',
'15',
'35',
15,
13,
26,
'22',
'41',
27,
46,
31,
11,
'18',
18,
56,
'23',
36,
17,
14,
20,
21,
'19',
16,
'29',
'25',
77,
28,
38,
'38',
'17',
'32',
35,
22,
33,
'54',
72,
'40',
44,
58,
'30',
'77',
'20',
'24',
'34',
24,
nan,
39,
'51',
57,
43,
37],
'C_CONF': ['34',
'01',
'QQ',
'04',
'31',
'21',
'23',
'03',
'02',
'33',
'UU',
'24',
'35',
'41',
'06',
'32',
'36',
'05',
'22',
'25'],
'C_RCFG': ['UU',
'QQ',
'01',
'02',
'03',
'05',
'04',
'06',
'08',
'07',
'09',
'10'],
'C_WTHR': ['1', '5', '3', '4', '7', '2', 'U', '6', 'Q'],
'C_RSUR': ['5', '3', '2', '4', '1', '6', 'U', 'Q', '7', '9', '8'],
'C_RALN': ['3', '6', '1', 'U', '2', '5', '4', 'Q'],
'C_TRAF': ['03',
'18',
'01',
'UU',
'06',
'10',
'05',
'04',
'11',
'QQ',
'07',
'08',
'16',
'17',
'02',
'13',
'15',
'09',
'12'],
'V_ID': ['01',
'02',
'99',
'03',
'04',
'UU',
'05',
'06',
'07',
'08',
'09',
'10',
'11',
'12',
'13',
'14',
'15',
'16',
'17',
'18',
'19',
'20',
'21',
'22',
'23',
'24',
'25',
'26',
'27',
'28',
'29',
'30',
'31',
'32',
'33',
'34',
'35',
'36',
'37',
'38',
'39',
'40',
'41',
'42',
'43',
'44',
'45',
'46',
'47',
'48',
'49',
'50',
'51',
'52',
'53',
'54',
'55',
'56',
'57',
'58',
'59',
'60',
'61',
'62',
'63',
'64',
'65',
'66',
'67',
'68',
'69',
'70',
'71',
2,
1,
99,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41,
'72',
'73',
'74',
'83',
'85',
'86',
'75',
'76',
'77',
42,
43,
44,
45,
46,
47,
48,
49,
51,
52,
50,
53,
54,
55,
56,
57],
'V_TYPE': ['06',
'01',
'NN',
'11',
'UU',
'20',
'17',
'07',
'08',
'QQ',
'09',
'22',
'14',
'23',
'05',
'16',
'19',
'18',
'10',
'21'],
'V_YEAR': ['1990',
'1987',
'1986',
'NNNN',
'1984',
'1991',
'1992',
'1997',
'1993',
'1985',
'1988',
'1994',
'1995',
'1998',
'1989',
'UUUU',
'1996',
'1983',
'1999',
'1965',
'1977',
'1978',
'1968',
'1981',
'1979',
'1976',
'1972',
'2000',
'1982',
'1975',
'1973',
'1974',
'1980',
'1967',
'1970',
'1971',
'1962',
'1969',
'1966',
'1945',
'1963',
'1960',
'1950',
'1964',
'1959',
'1955',
'1958',
'1903',
'1909',
'1949',
'1923',
'1961',
'1914',
'1908',
'1953',
'1906',
'1939',
'1925',
'1948',
'1938',
'1907',
'1952',
'1904',
'1917',
'1912',
'1944',
'1956',
'1930',
'1931',
'1951',
'1946',
'1947',
'1957',
'1954',
'1943',
'1901',
'1937',
'1905',
'1935',
'1926',
'1941',
'1932',
'1920',
'1933',
'1919',
'1915',
'1929',
'1928',
'2001',
'1913',
'1940',
'1927',
'2002',
'1916',
'1942',
'1918',
'2003',
'1924',
'1922',
'1934',
'2004',
'2005',
'2006',
'2007',
'2008',
'1911',
'2009',
'2010',
'2011',
'1936',
'2012',
'1910',
'1921',
'2013',
'2014',
'2015'],
'P_ID': ['01',
'02',
'03',
'04',
'05',
'06',
'NN',
'07',
'08',
'09',
'10',
'11',
'12',
'13',
'14',
'15',
'16',
'17',
'18',
'19',
'20',
'21',
'22',
'23',
'24',
'25',
'26',
'27',
'28',
'29',
'30',
'31',
'32',
'33',
'34',
'35',
'36',
'37',
'38',
'39',
'40',
'41',
'42',
'43',
'44',
'45',
'46',
'47',
'48',
'49',
'50',
'51',
'52',
'53',
'54',
'55',
'56',
'57',
'58',
'59',
'60',
'61',
'62',
'63',
'64',
'65',
'66',
'67',
'68',
'69',
'70',
'71',
'72',
'73',
'74',
'75',
'76',
'77',
'78',
'79',
'80',
'81',
'82',
'83',
'84',
'85',
'86',
'87',
'88',
'89',
'90',
'91',
'92',
'93',
'94',
'95',
'UU',
'99'],
'P_SEX': ['M', 'F', 'U', 'N'],
'P_AGE': ['41',
'19',
'20',
'46',
'05',
'28',
'21',
'UU',
'61',
'56',
'34',
'22',
'30',
'49',
'32',
'31',
'68',
'08',
'45',
'17',
'33',
'82',
'39',
'37',
'55',
'38',
'43',
'35',
'23',
'25',
'65',
'44',
'36',
'70',
'50',
'40',
'27',
'26',
'15',
'53',
'16',
'13',
'14',
'12',
'18',
'77',
'86',
'42',
'24',
'47',
'62',
'06',
'57',
'83',
'74',
'67',
'51',
'29',
'01',
'02',
'54',
'71',
'10',
'79',
'63',
'58',
'48',
'60',
'07',
'64',
'75',
'52',
'85',
'93',
'92',
'69',
'72',
'11',
'59',
'09',
'66',
'76',
'73',
'04',
'78',
'80',
'84',
'03',
'81',
'89',
'87',
'88',
'90',
'NN',
'91',
'95',
'97',
'94',
'99',
'98',
'96'],
'P_PSN': ['11',
'13',
'99',
'23',
'98',
'21',
'22',
'12',
'QQ',
'96',
'32',
'UU',
'NN',
'31',
'33',
'97'],
'P_ISEV': ['1', '2', '3', 'U', 'N'],
'P_SAFE': ['UU', '02', 'NN', '01', '13', '12', '09', 'QQ', '10', '11'],
'P_USER': ['1', '2', '3', 'U', '4', '5']}
Verificamos la existencia de entradas de datos duplicadas.
sum(df_pd_data.duplicated())
5069
Después de haber visto de manera rápida el dataset, las modificaciones que llevaremos a cabo serán las siguientes:
En primer lugar. como hemos explicado anteriormente en las modificaciones que ibamos a realizar, nuestro objetivo es prededir si existen victimas mortales o no en caso de accidente, sin tener en cuenta el nº de posibles fallecidos. Para ello, nos interesa tener una única fila por cada accidente. De esta manera evitamos el uso de filas inncesarias en las que solo varían datos que no aportan información relevante. Filtraremos el dataset por la variable P_PSN, la cual nos indica la posición de la persona en el accidente. Lo filtraremos por el valor "11" ya que este valor indica la posición de conductor. Elegimos esta posición puesto que siempre hay al menos un conductor (sea del vehículo que sea) involucrado en cualquier colisión.
df_pd_data = df_pd_data[df_pd_data["P_PSN"]=="11"]
df_pd_data = df_pd_data.replace(["UU", "U", "UUU", "UUUU"], np.nan)
Ahora, vamos a ver cuantos valores no nulos hay en cada una de nuestras columnas mediante 'msno.bar'.
msno.bar(df_pd_data, filter="Top", sort = 'ascending', color = 'orange')
<AxesSubplot:>
Por otro lado, observamos qué porcentaje de valores missing tienen las columnas con las que trabajaremos
np.round(df_pd_data.isnull().sum().sort_values(ascending=False)/df_pd_data.shape[0], 5)
P_SAFE 0.11292 C_RCFG 0.08528 C_RALN 0.07083 V_YEAR 0.06608 P_AGE 0.05239 P_SEX 0.03973 C_TRAF 0.03842 C_CONF 0.03332 P_USER 0.01667 C_WTHR 0.01475 C_RSUR 0.01346 C_HOUR 0.01031 P_ISEV 0.00807 V_TYPE 0.00657 C_WDAY 0.00023 C_VEHS 0.00009 C_MNTH 0.00005 V_ID 0.00002 P_ID 0.00000 C_SEV 0.00000 P_PSN 0.00000 C_YEAR 0.00000 dtype: float64
El mayor porcetanje de valores missing, perteneciente a la columna P_SAFE no supera el 12%.
A continuación, llamamos a la función unificar con el fin de modificar nuestro dataset unificando los valores "repetidos" (ej: 12, "12" = "12"). Por otra parte, transformar las columnas seleccionadas a númericas. A partir de ahora, los valores de nuestro dataset serán de la forma "01", "02" y así sucesivamente.
df_pd_data = unificar(df_pd_data, ["C_MNTH", "C_WDAY", "C_HOUR", "C_VEHS", "V_YEAR", "P_AGE"])
Comprobamos el nuevo número de filas duplicadas.
sum(df_pd_data.duplicated())
3642
Efectivamente, vemos que sí, y eliminamos dichas filas mediante drop_duplicates()
df_pd_data = df_pd_data.drop_duplicates()
df_pd_data = df_pd_data.reset_index(drop=True)
df_pd_data
| C_YEAR | C_MNTH | C_WDAY | C_HOUR | C_SEV | C_VEHS | C_CONF | C_RCFG | C_WTHR | C_RSUR | C_RALN | C_TRAF | V_ID | V_TYPE | V_YEAR | P_ID | P_SEX | P_AGE | P_PSN | P_ISEV | P_SAFE | P_USER | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1999 | 1 | 1 | 20 | 2 | 2 | 34 | NaN | 01 | 05 | 03 | 03 | 01 | 06 | 1990 | 01 | M | 41 | 11 | 01 | NaN | 01 |
| 1 | 1999 | 1 | 1 | 20 | 2 | 2 | 34 | NaN | 01 | 05 | 03 | 03 | 02 | 01 | 1987 | 01 | M | 19 | 11 | 01 | NaN | 01 |
| 2 | 1999 | 1 | 1 | 8 | 2 | 1 | 01 | NaN | 05 | 03 | 06 | 18 | 01 | 01 | 1986 | 01 | M | 46 | 11 | 01 | NaN | 01 |
| 3 | 1999 | 1 | 1 | 17 | 2 | 3 | 01 | 02 | 01 | 01 | 01 | 01 | 1984 | 01 | M | 28 | 11 | 01 | NaN | 01 | ||
| 4 | 1999 | 1 | 1 | 17 | 2 | 3 | 01 | 02 | 01 | 01 | 02 | 01 | 1991 | 01 | M | 21 | 11 | 01 | NaN | 01 | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3922439 | 2014 | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN | 10 | 01 | <NA> | 01 | F | 79 | 11 | 01 | NN | 01 |
| 3922440 | 2014 | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN | 11 | 01 | <NA> | 01 | F | 20 | 11 | 01 | NN | 01 |
| 3922441 | 2014 | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN | 12 | 01 | <NA> | 01 | F | 47 | 11 | 01 | NN | 01 |
| 3922442 | 2014 | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN | 13 | 07 | <NA> | 01 | M | 24 | 11 | 01 | NN | 01 |
| 3922443 | 2014 | <NA> | <NA> | 14 | 2 | 1 | 02 | 01 | 01 | 05 | 04 | 18 | 01 | 01 | 2006 | 01 | F | <NA> | 11 | 02 | 01 | 01 |
3922444 rows × 22 columns
Las nuevas dimensiones del dataset son:
print("Las dimensiones del dataset que vamos a analizar son", df_pd_data.shape[0], "x", df_pd_data.shape[1])
Las dimensiones del dataset que vamos a analizar son 3922444 x 22
Llamamos de nuevo a la función valores_unicos y comprobamos que ya no hay valores repetidos en nuestras columnas
valores_unicos(df_pd_data, k=150)
{'C_YEAR': [1999,
2000,
2001,
2002,
2003,
2004,
2005,
2006,
2007,
2008,
2009,
2010,
2011,
2012,
2013,
2014],
'C_MNTH': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, <NA>],
'C_WDAY': [1, 2, 3, 4, 5, 6, 7, <NA>],
'C_HOUR': [20,
8,
17,
15,
14,
1,
11,
13,
19,
16,
9,
2,
18,
12,
10,
23,
0,
6,
7,
21,
<NA>,
5,
22,
3,
4],
'C_SEV': [2, 1],
'C_VEHS': [2,
1,
3,
4,
6,
7,
9,
5,
<NA>,
13,
8,
12,
14,
10,
11,
16,
26,
71,
19,
25,
21,
27,
15,
35,
22,
41,
46,
31,
18,
56,
23,
36,
17,
20,
29,
77,
28,
38,
32,
33,
54,
72,
40,
44,
58,
30,
24,
34,
39,
51,
57,
43,
37],
'C_CONF': ['34',
'01',
'QQ',
'04',
'31',
'21',
'23',
'03',
'02',
'33',
nan,
'24',
'35',
'41',
'06',
'32',
'36',
'05',
'22',
'25'],
'C_RCFG': [nan,
'QQ',
'01',
'02',
'03',
'05',
'04',
'06',
'08',
'07',
'09',
'10'],
'C_WTHR': ['01', '05', '03', '04', '07', '02', nan, '06', 'Q'],
'C_RSUR': ['05', '03', '02', '04', '01', '06', nan, 'Q', '07', '09', '08'],
'C_RALN': ['03', '06', '01', nan, '02', '05', '04', 'Q'],
'C_TRAF': ['03',
'18',
'01',
nan,
'06',
'10',
'05',
'04',
'11',
'QQ',
'07',
'08',
'16',
'17',
'02',
'13',
'15',
'09',
'12'],
'V_ID': ['01',
'02',
'03',
'04',
'05',
'06',
'07',
nan,
'08',
'09',
'10',
'11',
'12',
'13',
'14',
'15',
'16',
'17',
'19',
'20',
'21',
'22',
'23',
'24',
'25',
'26',
'18',
'27',
'28',
'29',
'30',
'31',
'32',
'33',
'34',
'35',
'36',
'37',
'38',
'39',
'40',
'41',
'42',
'43',
'44',
'45',
'46',
'47',
'48',
'49',
'50',
'51',
'52',
'53',
'54',
'55',
'56',
'57',
'58',
'59',
'60',
'61',
'62',
'63',
'64',
'65',
'66',
'67',
'68',
'69',
'70',
'71',
'72',
'73',
'74',
'83',
'85',
'86',
'99',
'75',
'76',
'77'],
'V_TYPE': ['06',
'01',
'11',
nan,
'20',
'17',
'07',
'08',
'QQ',
'09',
'22',
'14',
'23',
'05',
'16',
'18',
'19',
'NN',
'10',
'21'],
'V_YEAR': [1990,
1987,
1986,
1984,
1991,
1992,
1997,
1993,
1985,
1988,
1994,
1995,
1998,
1989,
<NA>,
1996,
1983,
1999,
1965,
1977,
1978,
1968,
1981,
1979,
1976,
1972,
2000,
1982,
1975,
1973,
1974,
1980,
1967,
1970,
1971,
1962,
1969,
1966,
1945,
1963,
1960,
1950,
1964,
1959,
1955,
1958,
1903,
1909,
1949,
1923,
1961,
1914,
1908,
1953,
1906,
1939,
1925,
1948,
1938,
1907,
1952,
1904,
1917,
1912,
1944,
1956,
1930,
1931,
1951,
1946,
1947,
1957,
1954,
1943,
1901,
1937,
1905,
1935,
1926,
1941,
1932,
1920,
1933,
1919,
1915,
1929,
1928,
2001,
1913,
1940,
1927,
2002,
1916,
1942,
1918,
2003,
1924,
1922,
1934,
2004,
2005,
2006,
2007,
2008,
1911,
2009,
2010,
2011,
1936,
2012,
1910,
1921,
2013,
2014,
2015],
'P_ID': ['01',
'02',
'03',
'04',
'05',
'06',
'08',
'07',
'10',
'09',
'11',
'12',
'15',
'16',
'17',
'19',
'20',
'21',
'22',
'24',
'25',
'27',
'28',
'30',
'32',
'33',
'34',
'13',
'14',
'26',
'18',
'23',
'29',
'31',
'36',
'38',
'39',
'40',
'41',
'42',
'43',
'45',
'47',
'49',
'50',
'52',
nan],
'P_SEX': ['M', 'F', nan, 'N'],
'P_AGE': [41,
19,
46,
28,
21,
<NA>,
61,
34,
22,
30,
49,
32,
31,
68,
45,
17,
33,
82,
37,
55,
43,
35,
23,
25,
65,
44,
36,
70,
50,
27,
39,
15,
53,
16,
26,
77,
86,
42,
24,
47,
20,
62,
18,
38,
57,
83,
51,
29,
54,
71,
79,
58,
67,
48,
60,
64,
75,
52,
85,
40,
69,
74,
59,
56,
76,
66,
80,
78,
9,
72,
73,
84,
63,
81,
89,
87,
90,
11,
93,
88,
14,
5,
10,
6,
1,
95,
91,
12,
13,
2,
7,
92,
8,
4,
98,
94,
96,
99,
3,
97],
'P_PSN': ['11'],
'P_ISEV': ['01', '02', '03', nan, '0N'],
'P_SAFE': [nan, '02', '01', '13', 'NN', '12', '09', 'QQ', '10', '11'],
'P_USER': ['01', nan, '04', '05', '03']}
Mostramos ahora las clases de las variables para ver que se han cambiado correctamente.
df_pd_data.dtypes
C_YEAR int64 C_MNTH Int64 C_WDAY Int64 C_HOUR Int64 C_SEV int64 C_VEHS Int64 C_CONF object C_RCFG object C_WTHR object C_RSUR object C_RALN object C_TRAF object V_ID object V_TYPE object V_YEAR Int64 P_ID object P_SEX object P_AGE Int64 P_PSN object P_ISEV object P_SAFE object P_USER object dtype: object
Vamos a crear una variable en la que nos apoyaremos para tomar una decisión pero no la consideraremos para el análisis. Esta variable es diferencia e indicará los años que han pasado desde año del modelo del vehículo (V_YEAR) hasta el accidente. Lo que buscamos es que si alguna de las diferencias es menor que 0, eso querrá decir que el año del modelo del vehículo es mayor que el año del accidente.
df_pd_data["diferencia"] = df_pd_data["C_YEAR"]-df_pd_data["V_YEAR"]
Para ver si tenemos valores negativos en la columna diferencia haremos uso de la función valores_unicos.
valores_unicos(df_pd_data[["diferencia"]], k=150)
{'diferencia': [9,
12,
13,
15,
8,
7,
2,
6,
14,
11,
5,
4,
1,
10,
<NA>,
3,
16,
0,
34,
22,
21,
31,
18,
20,
23,
27,
-1,
17,
24,
26,
25,
19,
32,
29,
28,
37,
30,
33,
54,
36,
39,
49,
35,
40,
44,
41,
96,
90,
50,
76,
38,
85,
91,
46,
93,
60,
74,
51,
61,
92,
47,
95,
82,
87,
55,
43,
69,
68,
48,
53,
52,
42,
45,
56,
98,
62,
94,
64,
73,
58,
67,
79,
66,
80,
84,
70,
71,
83,
86,
75,
-2,
59,
65,
63,
77,
72,
81,
57,
100,
99,
101,
78,
89,
97,
88,
103]}
Efectivamente, hay valores negativos (-1 y -2). Luego, eliminamos las filas el año de colisión es menor que el año del modelo del vehículo. Para ello crearemos un bucle el cual irá recorriendo los índices de la lista df_pd_data[diferencia]. En el momento que encuentre un valor que sea negativo y, además no sea nulo, entonces añadirá el ínidice de la fila a la lista de índices. Esto nos servirá para después hacer un .drop sobre todos los ínidices de las filas en las que eso ocurre.
list_indexes=list()
for i in range(len(df_pd_data["diferencia"])):
if not pd.isnull(df_pd_data["diferencia"][i]) and df_pd_data["diferencia"][i]<0:
list_indexes.append(i)
Veamos cuántos valores vamos a eliminar.
print(len(list_indexes))
20330
Eliminamos tanto las filas calculadas como la columna diferencia, la cual ya no nos interesa para el análisis
df_pd_data = df_pd_data.drop(labels=list_indexes, axis=0)
df_pd_data = df_pd_data.drop(labels= ["diferencia"], axis=1)
La idea de modificar estas variables es convertirlas en cíclias, es decir, que el día de la semana 7 y el día de la semana 1 estén cerca matemáticamente hablando. Si no hacemos esto estaremos asegurando que domingo es más grande que lunes y esto no tiene sentido. Es por ello que haremos uso de las funciones coseno y seno para codificar las variables mencionadas cíclicamente. Para ello, llamamos a la función codificacion_ciclica
df_pd_data = codificacion_ciclica(df_pd_data, ["C_MNTH", "C_WDAY", "C_HOUR"])
A nuestro dataset se añadiran 6 columnas nuevas de tipo float correspondientes al seno y al coseno de cada una de las variables elegidas.
df_pd_data
| C_YEAR | C_MNTH | C_WDAY | C_HOUR | C_SEV | C_VEHS | C_CONF | C_RCFG | C_WTHR | C_RSUR | C_RALN | C_TRAF | V_ID | V_TYPE | V_YEAR | P_ID | P_SEX | P_AGE | P_PSN | P_ISEV | P_SAFE | P_USER | cos_C_MNTH | sin_C_MNTH | cos_C_WDAY | sin_C_WDAY | cos_C_HOUR | sin_C_HOUR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1999 | 1 | 1 | 20 | 2 | 2 | 34 | NaN | 01 | 05 | 03 | 03 | 01 | 06 | 1990 | 01 | M | 41 | 11 | 01 | NaN | 01 | 0.866025 | 0.5 | 0.62349 | 0.781831 | 0.682553 | -0.730836 |
| 1 | 1999 | 1 | 1 | 20 | 2 | 2 | 34 | NaN | 01 | 05 | 03 | 03 | 02 | 01 | 1987 | 01 | M | 19 | 11 | 01 | NaN | 01 | 0.866025 | 0.5 | 0.62349 | 0.781831 | 0.682553 | -0.730836 |
| 2 | 1999 | 1 | 1 | 8 | 2 | 1 | 01 | NaN | 05 | 03 | 06 | 18 | 01 | 01 | 1986 | 01 | M | 46 | 11 | 01 | NaN | 01 | 0.866025 | 0.5 | 0.62349 | 0.781831 | -0.57668 | 0.81697 |
| 3 | 1999 | 1 | 1 | 17 | 2 | 3 | 01 | 02 | 01 | 01 | 01 | 01 | 1984 | 01 | M | 28 | 11 | 01 | NaN | 01 | 0.866025 | 0.5 | 0.62349 | 0.781831 | -0.068242 | -0.997669 | ||
| 4 | 1999 | 1 | 1 | 17 | 2 | 3 | 01 | 02 | 01 | 01 | 02 | 01 | 1991 | 01 | M | 21 | 11 | 01 | NaN | 01 | 0.866025 | 0.5 | 0.62349 | 0.781831 | -0.068242 | -0.997669 | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3922439 | 2014 | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN | 10 | 01 | <NA> | 01 | F | 79 | 11 | 01 | NN | 01 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
| 3922440 | 2014 | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN | 11 | 01 | <NA> | 01 | F | 20 | 11 | 01 | NN | 01 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
| 3922441 | 2014 | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN | 12 | 01 | <NA> | 01 | F | 47 | 11 | 01 | NN | 01 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
| 3922442 | 2014 | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN | 13 | 07 | <NA> | 01 | M | 24 | 11 | 01 | NN | 01 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
| 3922443 | 2014 | <NA> | <NA> | 14 | 2 | 1 | 02 | 01 | 01 | 05 | 04 | 18 | 01 | 01 | 2006 | 01 | F | <NA> | 11 | 02 | 01 | 01 | <NA> | <NA> | <NA> | <NA> | -0.775711 | -0.631088 |
3902114 rows × 28 columns
A continuación mostramos un ejemplo de la codificación de la variable mes (C_MNTH) con una muestra de 40 datos. En el gráfico se observa claramente la forma cíclica de la variable.
df_selected = df_pd_data[["cos_C_MNTH","sin_C_MNTH"]].dropna().sample(40).sort_index()
display(df_selected)
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_selected["cos_C_MNTH"], y=df_selected["sin_C_MNTH"], mode="markers"))
fig.update_layout(xaxis = dict(title="cos_x"),
yaxis = dict(title="sin_x", scaleanchor = "x", scaleratio = 1))
fig.show()
| cos_C_MNTH | sin_C_MNTH | |
|---|---|---|
| 157793 | -0.5 | -0.866025 |
| 289487 | 0.866025 | 0.5 |
| 428266 | -0.866025 | -0.5 |
| 489011 | 0.5 | -0.866025 |
| 671319 | -1.0 | 0.0 |
| 812707 | 1.0 | -0.0 |
| 829080 | 0.866025 | 0.5 |
| 940023 | -1.0 | 0.0 |
| 965222 | -0.866025 | -0.5 |
| 1179840 | -0.5 | 0.866025 |
| 1267940 | -0.5 | -0.866025 |
| 1283965 | -0.5 | -0.866025 |
| 1297148 | -0.0 | -1.0 |
| 1343926 | 0.866025 | -0.5 |
| 1383726 | 0.866025 | 0.5 |
| 1484853 | -1.0 | 0.0 |
| 1686943 | 0.0 | 1.0 |
| 1704418 | 0.0 | 1.0 |
| 1734254 | -0.866025 | 0.5 |
| 1752162 | -1.0 | 0.0 |
| 1865454 | 0.866025 | -0.5 |
| 1909667 | 0.866025 | 0.5 |
| 2265507 | -1.0 | 0.0 |
| 2288815 | -0.866025 | -0.5 |
| 2333203 | -0.0 | -1.0 |
| 2337981 | -0.0 | -1.0 |
| 2404508 | 1.0 | -0.0 |
| 2421937 | 0.866025 | 0.5 |
| 2449400 | 0.0 | 1.0 |
| 2794261 | -0.0 | -1.0 |
| 2800198 | -0.0 | -1.0 |
| 2895568 | 0.0 | 1.0 |
| 3066311 | 1.0 | -0.0 |
| 3210988 | -0.5 | -0.866025 |
| 3246784 | 0.5 | -0.866025 |
| 3321242 | 0.5 | 0.866025 |
| 3375639 | -1.0 | 0.0 |
| 3379347 | -1.0 | 0.0 |
| 3841962 | -0.5 | -0.866025 |
| 3890410 | 0.866025 | -0.5 |
Mostramos cuantos valores missing tenemos por cada columna para ver que después de todas las modificaciones realizadas. Verficamos que los valores missing siguen formando parte de nuestros datos.
df_dtypes = pd.merge(df_pd_data.isnull().sum(axis = 0).sort_values().to_frame('missing_value').reset_index(),
df_pd_data.dtypes.to_frame('feature_type').reset_index(),
on = 'index',
how = 'inner')
df_dtypes
| index | missing_value | feature_type | |
|---|---|---|---|
| 0 | C_YEAR | 0 | int64 |
| 1 | C_SEV | 0 | int64 |
| 2 | P_PSN | 0 | object |
| 3 | P_ID | 2 | object |
| 4 | V_ID | 70 | object |
| 5 | C_MNTH | 208 | Int64 |
| 6 | sin_C_MNTH | 208 | Float64 |
| 7 | cos_C_MNTH | 208 | Float64 |
| 8 | C_VEHS | 340 | Int64 |
| 9 | C_WDAY | 897 | Int64 |
| 10 | sin_C_WDAY | 897 | Float64 |
| 11 | cos_C_WDAY | 897 | Float64 |
| 12 | V_TYPE | 25716 | object |
| 13 | P_ISEV | 31514 | object |
| 14 | cos_C_HOUR | 40283 | Float64 |
| 15 | C_HOUR | 40283 | Int64 |
| 16 | sin_C_HOUR | 40283 | Float64 |
| 17 | C_RSUR | 52612 | object |
| 18 | C_WTHR | 57633 | object |
| 19 | P_USER | 65153 | object |
| 20 | C_CONF | 130053 | object |
| 21 | C_TRAF | 150234 | object |
| 22 | P_SEX | 154981 | object |
| 23 | P_AGE | 207814 | Int64 |
| 24 | C_RALN | 276722 | object |
| 25 | V_YEAR | 298915 | Int64 |
| 26 | C_RCFG | 333342 | object |
| 27 | P_SAFE | 440895 | object |
Mostramos un análisis descriptivo de todas las variables del dataset. Al tener pocas variables numéricas y muchos valores missing, no nos va a portar gran información.
df_pd_data.describe(include = "all")
| C_YEAR | C_MNTH | C_WDAY | C_HOUR | C_SEV | C_VEHS | C_CONF | C_RCFG | C_WTHR | C_RSUR | C_RALN | C_TRAF | V_ID | V_TYPE | V_YEAR | P_ID | P_SEX | P_AGE | P_PSN | P_ISEV | P_SAFE | P_USER | cos_C_MNTH | sin_C_MNTH | cos_C_WDAY | sin_C_WDAY | cos_C_HOUR | sin_C_HOUR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3.902114e+06 | 3.901906e+06 | 3.901217e+06 | 3.861831e+06 | 3.902114e+06 | 3.901774e+06 | 3772061 | 3568772 | 3844481 | 3849502 | 3625392 | 3751880 | 3902044 | 3876398 | 3.603199e+06 | 3902112 | 3747133 | 3.694300e+06 | 3902114 | 3870600 | 3461219 | 3836961 | 3.901906e+06 | 3.901906e+06 | 3.901217e+06 | 3.901217e+06 | 3.861831e+06 | 3.861831e+06 |
| unique | NaN | NaN | NaN | NaN | NaN | NaN | 19 | 11 | 8 | 10 | 7 | 18 | 81 | 19 | NaN | 46 | 3 | NaN | 1 | 4 | 9 | 4 | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | NaN | NaN | NaN | NaN | NaN | 21 | 02 | 01 | 01 | 01 | 18 | 01 | 01 | NaN | 01 | M | NaN | 11 | 02 | 02 | 01 | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | NaN | NaN | NaN | NaN | NaN | 1218544 | 1832204 | 2710935 | 2566589 | 2786471 | 2045240 | 2172228 | 3296856 | NaN | 3875621 | 2339978 | NaN | 3902114 | 1952355 | 2778877 | 3635934 | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 2.006007e+03 | 6.682391e+00 | 3.909504e+00 | 1.355508e+01 | 1.985429e+00 | 2.096262e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.998483e+03 | NaN | NaN | 3.956105e+01 | NaN | NaN | NaN | NaN | 7.984994e-04 | -5.393182e-02 | -6.780699e-02 | -2.582404e-02 | -2.573130e-01 | -2.815464e-01 |
| std | 4.568234e+00 | 3.437303e+00 | 1.905761e+00 | 5.113093e+00 | 1.198270e-01 | 1.308572e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7.121161e+00 | NaN | NaN | 1.643964e+01 | NaN | NaN | NaN | NaN | 7.167983e-01 | 6.951915e-01 | 6.854874e-01 | 7.244602e-01 | 6.352853e-01 | 6.715166e-01 |
| min | 1.999000e+03 | 1.000000e+00 | 1.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.901000e+03 | NaN | NaN | 1.000000e+00 | NaN | NaN | NaN | NaN | -1.000000e+00 | -1.000000e+00 | -9.009689e-01 | -9.749279e-01 | -9.906859e-01 | -9.976688e-01 |
| 25% | 2.002000e+03 | 4.000000e+00 | 2.000000e+00 | 1.000000e+01 | 2.000000e+00 | 2.000000e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.994000e+03 | NaN | NaN | 2.600000e+01 | NaN | NaN | NaN | NaN | -8.660254e-01 | -8.660254e-01 | -9.009689e-01 | -7.818315e-01 | -7.757113e-01 | -8.878852e-01 |
| 50% | 2.006000e+03 | 7.000000e+00 | 4.000000e+00 | 1.400000e+01 | 2.000000e+00 | 2.000000e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.999000e+03 | NaN | NaN | 3.800000e+01 | NaN | NaN | NaN | NaN | -1.836970e-16 | -2.449294e-16 | -2.225209e-01 | -2.449294e-16 | -3.348796e-01 | -5.195840e-01 |
| 75% | 2.010000e+03 | 1.000000e+01 | 5.000000e+00 | 1.700000e+01 | 2.000000e+00 | 2.000000e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.004000e+03 | NaN | NaN | 5.100000e+01 | NaN | NaN | NaN | NaN | 8.660254e-01 | 5.000000e-01 | 6.234898e-01 | 7.818315e-01 | 2.034560e-01 | 2.697968e-01 |
| max | 2.014000e+03 | 1.200000e+01 | 7.000000e+00 | 2.300000e+01 | 2.000000e+00 | 7.700000e+01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.014000e+03 | NaN | NaN | 9.900000e+01 | NaN | NaN | NaN | NaN | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 9.749279e-01 | 1.000000e+00 | 9.976688e-01 |
Dada la gran cantidad de valores nulos, la mayoría de datos estadísticos son NaN
Atendemos a la distribución de la variable objetivo. Creamos un data frame en el que mostramos cuántos valores tenemos en cada clase de la variable objetivo y el porcentaje.
pd_plot_csev = df_pd_data["C_SEV"]\
.value_counts(normalize=True)\
.mul(100).rename("Porcentaje").reset_index()
pd_plot_csev_conteo = df_pd_data["C_SEV"].value_counts().reset_index()
pd_plot_csev_pc = pd.merge(pd_plot_csev, pd_plot_csev_conteo, on =["index"], how="inner")
pd_plot_csev_pc
| index | Porcentaje | C_SEV | |
|---|---|---|---|
| 0 | 2 | 98.542918 | 3845257 |
| 1 | 1 | 1.457082 | 56857 |
Mostramos gráficamente la distribución de la variable objetivo
fig = px.histogram(pd_plot_csev_pc, x="index", y=["Porcentaje"])
fig.show()
Como podemos observar, más del 98% de los datos de nuestra variable Taget corresponde a la categoría "02" = Ningún fallecido. A la hora de realizar el modelo tendremos que tener en cuenta esta distribución dado que el dataset esta desbalanceado
Distribución de la variable objetivo en función de otras variables del dataset.
Realizamos la distribución de nuestra variable objetivo respecto aquellas variables que consideramos más importantes para analizar.
C_SEV ~ C_WDAY
pd_data_copia = df_pd_data[:]
pd_data_copia['fatal']=np.where(df_pd_data['C_SEV']==1,1,0)
pd_data_copia['non_fatal']=np.where(df_pd_data['C_SEV']==2,1,0)
df1 = pd_data_copia.groupby('C_WDAY')['fatal','non_fatal'].sum()
df1
/var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/134660928.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/134660928.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/134660928.py:5: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
| fatal | non_fatal | |
|---|---|---|
| C_WDAY | ||
| 1 | 7173 | 527826 |
| 2 | 7169 | 563694 |
| 3 | 7192 | 570421 |
| 4 | 8227 | 597245 |
| 5 | 9669 | 661317 |
| 6 | 9330 | 514215 |
| 7 | 8093 | 409646 |
C_SEV ~ C_HOUR
pd_data_copia_3 = df_pd_data[:]
pd_data_copia_3['fatal']=np.where(df_pd_data['C_SEV']==1,1,0)
pd_data_copia_3['non_fatal']=np.where(df_pd_data['C_SEV']==2,1,0)
df3 = pd_data_copia_3.groupby('C_HOUR')['fatal','non_fatal'].sum()
df3
/var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/1861695494.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/1861695494.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/1861695494.py:5: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
| fatal | non_fatal | |
|---|---|---|
| C_HOUR | ||
| 0 | 1626 | 51765 |
| 1 | 1581 | 42682 |
| 2 | 1638 | 43560 |
| 3 | 1387 | 37571 |
| 4 | 950 | 25504 |
| 5 | 1127 | 31306 |
| 6 | 1767 | 79401 |
| 7 | 2154 | 155245 |
| 8 | 2146 | 216652 |
| 9 | 2174 | 161212 |
| 10 | 2401 | 169938 |
| 11 | 2522 | 201761 |
| 12 | 2767 | 244366 |
| 13 | 2900 | 241742 |
| 14 | 3236 | 260953 |
| 15 | 3646 | 322903 |
| 16 | 3732 | 347727 |
| 17 | 3571 | 333387 |
| 18 | 3260 | 238591 |
| 19 | 2771 | 173050 |
| 20 | 2464 | 130770 |
| 21 | 2315 | 122439 |
| 22 | 2195 | 97871 |
| 23 | 1857 | 75248 |
C_SEV ~ C_MNTH
pd_data_copia_4 = df_pd_data[:]
pd_data_copia_4['fatal']=np.where(df_pd_data['C_SEV']==1,1,0)
pd_data_copia_4['non_fatal']=np.where(df_pd_data['C_SEV']==2,1,0)
df4 = pd_data_copia_4.groupby('C_MNTH')['fatal','non_fatal'].sum()
df4
/var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/2297464711.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/2297464711.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/2297464711.py:5: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
| fatal | non_fatal | |
|---|---|---|
| C_MNTH | ||
| 1 | 4418 | 334069 |
| 2 | 3815 | 283391 |
| 3 | 3730 | 275779 |
| 4 | 3619 | 259648 |
| 5 | 4432 | 310341 |
| 6 | 4978 | 344635 |
| 7 | 5655 | 344103 |
| 8 | 5879 | 350019 |
| 9 | 5335 | 340135 |
| 10 | 5097 | 339481 |
| 11 | 4858 | 327783 |
| 12 | 5041 | 335665 |
C_SEV ~ C_CONF
pd_data_copia_2 = df_pd_data[:]
pd_data_copia_2['fatal']=np.where(df_pd_data['C_SEV']==1,1,0)
pd_data_copia_2['non_fatal']=np.where(df_pd_data['C_SEV']==2,1,0)
df2 = pd_data_copia_2.groupby('C_CONF')['fatal','non_fatal'].sum()
df2
/var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/880078185.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/880078185.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/880078185.py:5: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
| fatal | non_fatal | |
|---|---|---|
| C_CONF | ||
| 01 | 1253 | 45329 |
| 02 | 2341 | 102642 |
| 03 | 3366 | 98803 |
| 04 | 3931 | 130344 |
| 05 | 177 | 8543 |
| 06 | 6508 | 283613 |
| 21 | 4103 | 1214441 |
| 22 | 1631 | 131433 |
| 23 | 634 | 47984 |
| 24 | 273 | 37728 |
| 25 | 70 | 8347 |
| 31 | 13950 | 117313 |
| 32 | 1096 | 30337 |
| 33 | 2660 | 283174 |
| 34 | 456 | 43288 |
| 35 | 7268 | 592107 |
| 36 | 1751 | 290620 |
| 41 | 914 | 55719 |
| 3034 | 194880 |
C_SEV ~ C_WTHR
pd_data_copia_5 = df_pd_data[:]
pd_data_copia_5['fatal']=np.where(df_pd_data['C_SEV']==1,1,0)
pd_data_copia_5['non_fatal']=np.where(df_pd_data['C_SEV']==2,1,0)
df5 = pd_data_copia_5.groupby('C_WTHR')['fatal','non_fatal'].sum()
df5
/var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/244688280.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/244688280.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/244688280.py:5: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
| fatal | non_fatal | |
|---|---|---|
| C_WTHR | ||
| 01 | 37219 | 2673716 |
| 02 | 7358 | 404906 |
| 03 | 4749 | 388288 |
| 04 | 4129 | 229770 |
| 05 | 419 | 19126 |
| 06 | 2057 | 53397 |
| 07 | 318 | 9499 |
| Q | 141 | 9389 |
Realizamos un gráfico de barras para analizar la distribución de las la variable objetivo en función del resto de variables. Para ello, llamamos a la función graf_barras. El gráfico nos mosotrara la distriubción de cada una de las variables a la izquierda. A la derecha, la distribución de la variable target en función de cada una de las variables.
Elimminamos las colummnas de seno y coseno creadas anteriormente para realizar el plot en cuestión.
columnas = df_pd_data.drop(["sin_C_HOUR", "cos_C_HOUR", "sin_C_WDAY", "cos_C_WDAY", "cos_C_MNTH", "sin_C_MNTH"], axis=1).columns
for i in list(columnas):
if (df_pd_data[i].dtype==float) & (i!='C_SEV'):
graf_barras(df_pd_data, columna=i, isContinuous=True, var_objetivo='C_SEV')
elif i!='C_SEV':
graf_barras(df_pd_data, columna=i, isContinuous=False, var_objetivo='C_SEV')
/Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. /Users/aitor/Desktop/Máster Ciencia de Datos/Aprendizaje automático/Machine_Learning/big_practice/notebooks/FUNCIONES.py:117: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). /Users/aitor/opt/anaconda3/envs/ML_big_practice/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
Gracias a estos gráficos podemos conocer la distribución de cada una de las variables, el número de valores nulos existentes en cada una de ellas y el comportamiento de nuestra variable objetivo respecto a cada valor tomado por la variable en cuestión. Como se puede observar, en la mayoría de variables a excepción de aquellas relacionadas con la fecha de colisión, la distribución predomina para unos pocos valores. Algunas disitribuciones las comentaremos posteriormente. Por otro lado, como ya habiamos desmostrado con anterioridad para una inmensa mayoría de valores predomina el valor 2 de nuestra variable objetivo, correspondiente a ningún fallecido. A modo de excepción, podemos apreciar como es lógico, que cuando la variable P_ISEV (Intervención médica) toma el valor 03(Fallecimiento en el acto) corresponden a accidentes en los que según nuestra variable target ha tenido lugar al menos un fallecimiento (2)
Mostramos el gráfico de correlaciones de las variables númericas.
num_data = df_pd_data.select_dtypes(include=["Int64", "int64","Float64"]).columns
numerico= df_pd_data[num_data]
corr = numerico.corr(method = 'spearman')
plt.figure(figsize = (10, 8))
sns.heatmap(corr.abs(), cmap ='viridis')
plt.show()
Una vez realizado el filtrado sobre nuestro dataset, decidimos hacer una selección de columnas con las que trabajaremos para llevar a cabo el modelo final.
Hemos descartado todas las columnas correspondientes a elementos de vehículos y de personas. Es decir, todas las colummnas que contienen información sobre los vehículos involucrados (V ...) y personas (P ...) quedan excluidas de nuestro conjunto de datos final. Esta exclusión se debe a que, como se ha indicado en la definición del problema, nuestro modelo tratará de predecir si hay fallecimientos o no en el momento exacto en el que se produce un accidente.
Esto significa que toda la información conocida después de ese instante, es información que no conocemos. Hasta que se produce una investigación, datos como el año del modelo del vehículo, o si ha sido necesaria intervención medica, o bien cualquier otro tipo de información descripitiva, son datos desconocidos para nosotros.
En el momento en el que se produce una colisión solo se tiene información acerca de aquello que este relacionado directamente con la colisión, como puede ser la configuaración de la carretera o el número de vehiculos involucrados, o bien datos .... como la fecha o las condiciones climáticas.
Por otra parte, al transformar las variables C_MNTH, C_WDAY y C_HOUR a cíclicas nos quedamos con aquellas columnas correspondientes a dicha transformación (seno y coseno) y eliminamos de nuestro dataset las columnas originales.
list_subset_columns = ['C_YEAR', 'cos_C_MNTH', "sin_C_MNTH", 'cos_C_WDAY', "sin_C_WDAY", "cos_C_HOUR", 'sin_C_HOUR',
'C_SEV', 'C_VEHS', 'C_CONF', 'C_RCFG', 'C_WTHR', 'C_RSUR', 'C_RALN', 'C_TRAF']
df_subset_pd_data = df_pd_data[list_subset_columns]
df_subset_pd_data
| C_YEAR | cos_C_MNTH | sin_C_MNTH | cos_C_WDAY | sin_C_WDAY | cos_C_HOUR | sin_C_HOUR | C_SEV | C_VEHS | C_CONF | C_RCFG | C_WTHR | C_RSUR | C_RALN | C_TRAF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1999 | 0.866025 | 0.5 | 0.62349 | 0.781831 | 0.682553 | -0.730836 | 2 | 2 | 34 | NaN | 01 | 05 | 03 | 03 |
| 1 | 1999 | 0.866025 | 0.5 | 0.62349 | 0.781831 | 0.682553 | -0.730836 | 2 | 2 | 34 | NaN | 01 | 05 | 03 | 03 |
| 2 | 1999 | 0.866025 | 0.5 | 0.62349 | 0.781831 | -0.57668 | 0.81697 | 2 | 1 | 01 | NaN | 05 | 03 | 06 | 18 |
| 3 | 1999 | 0.866025 | 0.5 | 0.62349 | 0.781831 | -0.068242 | -0.997669 | 2 | 3 | 01 | 02 | 01 | 01 | ||
| 4 | 1999 | 0.866025 | 0.5 | 0.62349 | 0.781831 | -0.068242 | -0.997669 | 2 | 3 | 01 | 02 | 01 | 01 | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3922439 | 2014 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN |
| 3922440 | 2014 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN |
| 3922441 | 2014 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN |
| 3922442 | 2014 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN |
| 3922443 | 2014 | <NA> | <NA> | <NA> | <NA> | -0.775711 | -0.631088 | 2 | 1 | 02 | 01 | 01 | 05 | 04 | 18 |
3902114 rows × 15 columns
Distribución normalizada
categorical = ['C_CONF', 'C_RCFG', 'C_WTHR', 'C_RSUR', 'C_RALN', 'C_TRAF']
fig, axes = plt.subplots(3, 2, figsize=(20, 20), sharey=False)
fig.suptitle('Categorial Values by Severity (normalized)')
i = 0
j = 0
for cat in categorical:
temp = norm_category(df = df_subset_pd_data, obj_val = 'C_SEV', cat_val = cat)
sns.barplot(data = temp, x = cat, y = 'group%', hue = 'C_SEV', ax = axes[i,j]);
if (j == 0):
j = 1
else:
j = 0
i += 1
Por último, tras haber realizado los descartes y con nuestro dataset ya filtrado prodecemos a realizar una agrupación de nuestros datos en función de cada variable y su distribución respecto a la variable objetivo, con el fin de reducir el número de categorias para que el modelo trabaje de forma más eficiente.
En el caso de la variable C_CONF(Configuarción de la colisión) hemos agrupado las posibles opciones de la variable en 4 categorías correspondientes a la cantidad y la dirección de los vehículos involucrados. (Vease diccionario de datos)
Para la variable C_WTHR(Clima) realizamos la agrupación mediante una distinción entre clima favorable para la conducción y clima adverso. Como se observa en eñ gráfico previamente mostrado, la mayoría de colisiones tienen lugar en la categoría 01 de la variable, correspondiente un día despejado y soleado. (Vease diccionario de datos)
En la columna C_RALN(Alineación de la carretera) como se aprecia en la distibución respecto a la variable objetivo, la mayoría de accidentes tienen lugar en carreteras rectas, sin curvas ni pendientes (01), por lo tanto hemos agrupado la variable en dos categorías: carretera recta y sin pendiente y por otro lado carretera con curvas, pendiente etc. (Vease diccionario de datos)
En La variable C_SUR(Superficie) hemos realizado una distinción en dos grupos. Uno correspondiente a escensarios en los que la carretera está seca, en estado normal, cuando se producen la mayoria de colisiones si observamos la distribución (01) y un segundo grupo, el cual corresponde a escenarios donde la carretera se encuentra mojada, nevada etc. (Vease diccionario de datos)
En el caso de la variable C_TRAF(Tráfico) hemos realizado una agrupación según lo observado en las dsitribuciones. Por una parte mantenemos aquellos casos en el que las señales de trafico funcionan con completa normalidad (01), y por otro lado, una distinción para los casos en los que no existe nigún tipo de control (18 en el dataset original) y aquellos escenarios donde ciertas reestricciones como pasos de cebra, colegios infantiles cerca etc. (Vease diccionario de datos)
La última agrupación la realizamos sobre la variable C_RCFG(Configuración de la carretera). Hemos distinguido dos grupos dentro de las categorias existentes. Una selección corresponde a carreteras sin nigún tipo de intersección, y el segundo grupo, a aquellas carreteras o localizaciones en las existan intersecciones, o cualquier otro tipo de elemento que altere la carretera, túneles, puentes, raíles etc. (Vease diccionario de datos)
Realizamos las agrupaciones correspondientes para cada variable
df_subset_pd_data["C_CONF"] = df_subset_pd_data["C_CONF"].replace(["01", "02", "03", "04", "05", "06"], "01")
df_subset_pd_data["C_CONF"] = df_subset_pd_data["C_CONF"].replace(["21", "22", "23", "24", "25"], "02")
df_subset_pd_data["C_CONF"] = df_subset_pd_data["C_CONF"].replace(["31", "32", "33", "34", "35", "36"], "03")
df_subset_pd_data["C_CONF"] = df_subset_pd_data["C_CONF"].replace(["41"], "04")
/var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/3702978386.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/3702978386.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/3702978386.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/3702978386.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_subset_pd_data["C_WTHR"] = df_subset_pd_data["C_WTHR"].replace(["02", "03", "04", "05", "06", "07"], "02")
/var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/65764932.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_subset_pd_data["C_RALN"] = df_subset_pd_data["C_RALN"].replace(["02", "03", "04", "05", "06"], "02")
/var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/2590315117.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_subset_pd_data["C_RSUR"] = df_subset_pd_data["C_RSUR"].replace(["02", "03", "04", "05", "06", "07", "08", "09"], "02")
/var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/319382664.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_subset_pd_data["C_TRAF"] = df_subset_pd_data["C_TRAF"].replace(["03", "04", "05", "06", "07", "08", "09", "10",
"11", "12", "13", "14", "15", "16", "17"], "02")
df_subset_pd_data["C_TRAF"] = df_subset_pd_data["C_TRAF"].replace(["01", "02"], "01")
df_subset_pd_data["C_TRAF"] = df_subset_pd_data["C_TRAF"].replace(["18"], "03")
/var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/1897294880.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/1897294880.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/1897294880.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_subset_pd_data["C_RCFG"] = df_subset_pd_data["C_RCFG"].replace(["02", "03", "04", "05", "06", "07", "08", "09", "10"], "02")
/var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/2147248158.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Mostramos los valores únicos que nos quedan en cada una de las columnas. Se puede observar como se ha reducido notablemente el número de categorias posibles en cada una de las variables transformadas.
valores_unicos(df_subset_pd_data[['C_CONF', 'C_RCFG', 'C_WTHR', 'C_RSUR', 'C_RALN', 'C_TRAF']])
{'C_CONF': ['03', '01', 'QQ', '02', nan, '04'],
'C_RCFG': [nan, 'QQ', '01', '02'],
'C_WTHR': ['01', '02', nan, 'Q'],
'C_RSUR': ['02', '01', nan, 'Q'],
'C_RALN': ['02', '01', nan, 'Q'],
'C_TRAF': ['01', '03', nan, 'QQ']}
Introducimos una variable aleatoria con el fin de evaluar de manera más precisa la importancia de las variables. Es decir, a la hora de interpretar la importancia de las variables en nuestro modelo final, aquellas cuya significación esté por debajo de la variable aleatoria creada significará que no aportan información a nuestro modelo de predicción.
df_subset_pd_data["Random"]=np.random.randint(100, size=df_subset_pd_data.shape[0])
/var/folders/3q/zfn8yb513nj10ry8w8r8_jmc0000gn/T/ipykernel_30834/101112993.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Mostramos a continuación, el dataset final con el que evaluaremos los modelos
df_subset_pd_data
| C_YEAR | cos_C_MNTH | sin_C_MNTH | cos_C_WDAY | sin_C_WDAY | cos_C_HOUR | sin_C_HOUR | C_SEV | C_VEHS | C_CONF | C_RCFG | C_WTHR | C_RSUR | C_RALN | C_TRAF | Random | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1999 | 0.866025 | 0.5 | 0.62349 | 0.781831 | 0.682553 | -0.730836 | 2 | 2 | 03 | NaN | 01 | 02 | 02 | 01 | 27 |
| 1 | 1999 | 0.866025 | 0.5 | 0.62349 | 0.781831 | 0.682553 | -0.730836 | 2 | 2 | 03 | NaN | 01 | 02 | 02 | 01 | 88 |
| 2 | 1999 | 0.866025 | 0.5 | 0.62349 | 0.781831 | -0.57668 | 0.81697 | 2 | 1 | 01 | NaN | 02 | 02 | 02 | 03 | 7 |
| 3 | 1999 | 0.866025 | 0.5 | 0.62349 | 0.781831 | -0.068242 | -0.997669 | 2 | 3 | 01 | 02 | 01 | 01 | 77 | ||
| 4 | 1999 | 0.866025 | 0.5 | 0.62349 | 0.781831 | -0.068242 | -0.997669 | 2 | 3 | 01 | 02 | 01 | 01 | 23 | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3922439 | 2014 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN | 7 |
| 3922440 | 2014 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN | 98 |
| 3922441 | 2014 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN | 80 |
| 3922442 | 2014 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | 2 | <NA> | NaN | 01 | NaN | NaN | NaN | NaN | 33 |
| 3922443 | 2014 | <NA> | <NA> | <NA> | <NA> | -0.775711 | -0.631088 | 2 | 1 | 01 | 01 | 01 | 02 | 02 | 03 | 16 |
3902114 rows × 16 columns
table = pa.Table.from_pandas(df_subset_pd_data, preserve_index=False)
pq.write_table(table, '/Users/aitor/Desktop/Máster Ciencia de Datos/Aprendizaje automático/Machine_Learning/big_practice_data/pd_data.parquet')